Inferring Cellular Developmental Time

“From Multivariate to Longitudinal Data”

April 11, 2017

Overview


  • Motivation
    • Single cell RNA-Seq

  • Model Dataset
    • EDA

  • Methods of estimating developmental time (“interpreting reduced dimensions”)
    • PCA
    • Probablistic PCA
    • Bayesian PCA
    • Gaussian Process Latent Variable Modeling

Trajectories...

~3 million cells
~ 20 billion cells
~ 50 trillion cells

  • What are the key points in development for disease?

Questions from a developmental biology perspective


  • What happens in a cell such that it becomes a brain, toe, or a heart?
    • When do these decisions get made? “Who” makes them?

  • How do the developmental trajectories of disease (leukemia / schizophrenia) differ from healthy individuals?

  • Can we identify important transition points and the genetic signature underlying them?

Waddington Landscape



Some cancers regain stemness programs


Stergachis \( et al. \), Cell 2013

``Stem cell-like" covariate is important in AML




Corces \( et al. \) Nature Genetics, 2016

How do we characterize single cells?


Proserpio and Mahata, Immunology 2015












EDA


> dim(deng)

[1] 17585   255

> sum(deng == 0) / prod(dim(deng))

[1] 0.5019552

> head(sample(colnames(deng)))

[1] "earlyblast" "16cell"     "4cell"      "midblast"   "lateblast"  "16cell"   

> head(sample(rownames(deng)))

[1] "Gm7073"  "Mir697"  "Uqcrc2"  "Ap2m1"   "Slc10a3" "Ccr1"   

Linearly increasing

Dropoff

Linear Decreasing

Varying, no clear effect

V-shaped

Transition on/off

Sigmoidal with dropout

Overall picture


\( \forall \) gene \( g \), fit OLS Regression with known timepoint \( t \) per cell–

\[ \log_2(g +1) \sim \beta_0 + \beta_1 t \]

Permuted


\( \forall \) gene \( g \), fit OLS Regression with permuted timepoint \( t^* \) per cell–

\[ \log_2(g +1) \sim \beta_0 + \beta_1 t^* \]

Permuted


\( \forall \) gene \( g \), fit OLS Regression with permuted factor timepoint \( t^{**} \) per cell–

\[ \log_2(g +1) \sim \beta_0 + \beta_1 t^{**} \]

Statement of problem


Given a matrix of \( D \) genes (features) by \( n \) samples in a matrix \( Y \), determine a latent vector \( P \) with dimension \( 1 \) x \( n \) that reflects the (smooth) developmental trajectory of the \( n \) cells from the variance in \( D \) genes.

  • \( D \) can be thought of has a higher dimension space, and we want to infer \( d \) (\( d \) < \( D \)) latent variables in the gene data.
    • One of the \( d \) latent variables ideally reflects developmental ordering.

Perfect latent variable


> cor(runif(length(time)) %>% sort(), as.numeric(time) %>% sort())^2

[1] 0.8799669

PCA


Computing PCA


\( Y \) is a \( D \) x \( n \) matrix. Compute the covariance matrix–

\[ \Sigma = E(YY^{T}) - \mu \mu^T \] where \( \mu = E(Y) \).

Then compute the spectral decomposition of the covariance matrix, \( \Sigma \)

\[ \Sigma a_j = \lambda_j a_j \] for \( j \in (1, ..., D) \)

Then \( a_j \) represent the eigenvectors of the data matrix \( Y \). Note the ordering of \( j \) is meaningful–

\[ \lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D \geq 0 \]


> irlba::prcomp_irlba()

> prcomp()

> princomp()

PCA


> cor(pca$rotation[,1], as.numeric(time))^2

[1] 0.5933009

Correlation with all PCs

Pause...




PCA by itself isn't satisfactory…



Ideas for improvements?

Improving on PCA



1) Quantifiying uncertainty

2) Non-linear latent variable inference

Uncertainty




From Buenrostro \( et al. \) bioRxiv 2017

Mannifold / Non-linear dimension learning


  • Next several images taken from slides via Guy Wolf (Yale)

  • These slides can be found here





Tackling 1 and building to 2









Slide from Neil Lawrence

Computing Probablistic PCA -- Factor Analysis



\( Y \) is a \( D \) x \( n \) matrix. \( x \) is a \( d \) x \( n \) matrix (\( d < D \)). Want to relate \( x \) and \( Y \) and assume that the relationship is linear–

\[ Y = Wx + \epsilon \] where \( W \) is a \( D \) x \( d \) matrix.


By convention (after locating the matrix), \[ x \sim \mathcal{N} (0, I) \] where \( I \) is the identity matrix of dimension \( d \) x \( d \).


Specify the error, \[ \epsilon \sim \mathcal{N} (0, \Psi) \] where we assume \( \Psi \) to be a diagonal matrix \( D \) x \( D \).

  • Assuming the diagonal structure of \( \Psi \) means that all observertional correlations is due to the latent variables.
  • In other words, \( Y_i \) for \( i \in (1, ..., n) \) are conditionally independent given \( x \)

Computing Probablistic PCA -- Factor Analysis II


Then \( Y_i \) for \( i \in (1, ..., n) \), \[ Y_i | x_i \sim \mathcal{N} (0, \Psi) \]
Then we can integrate out the latent variables–

\[ \int p( Y_i | x_i , W, \Psi) p(x_i) dx_i \]

Under this specification, we can write the MVN distribution for our observations–

\[ Y \sim \mathcal{N} (0, WW^T + \Psi) \]

Computing Probablistic PCA -- Factor Analysis II


Young-Whittle Factor Analysis (1950s)
\( \psi_i \) element of the diagonal of \( \Psi \); add constraint \( \psi_i = \sigma^2 \)

Bayesian PCA





  • \( \alpha_i \) represents the inverse of the variance; for large \( \alpha_i \), contribution of latent factors is low.
  • Solved by EM / similar algorithm

Bayesian / Probablistic PCA


library(pcaMethods)

pca()

resPCA   <- pca(data, method="svd",       center=FALSE, nPcs=5)
resPPCA  <- pca(data, method="ppca",      center=FALSE, nPcs=5)
resBPCA  <- pca(data, method="bpca",      center=FALSE, nPcs=5)
resSVDI  <- pca(data, method="svdImpute", center=FALSE, nPcs=5)



GPLVM


library(pseudogp)

fit <- fitPseudotime(data, smoothing_alpha = 30, smoothing_beta = 6,
                        iter = 1000, chains = 1)

posteriorBoxplot(fit)

GPLVM -- Noteable application



GPLVM



> cor(gplvm_means, as.numeric(time))^2

[1] 0.783522

GPLVM versus PCA 1



> cor(pca$rotation[,1], as.numeric(time))^2

[1] 0.5933009

> cor(gplvm_means, as.numeric(time))^2

[1] 0.783522

GPLVM versus PCA 2



Wrapping up...


  • GPLVM provide both a probablistic and non-linear latent variable inference structure
    • All other published algorithms provide point estimates, difficult to ascertain uncertainty

  • Under this framework, Bayesian priors are allowed, which have been useful in published studies

  • Tools and methods useful in many data analyses contexts (including machine learning)
    • Motivated here by single cell analysis

Detailed look at early embryo neurogenesis?





Thanks!